Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition

Digitization created a demand for highly efficient handwritten document recognition systems. A handwritten document consists of digits, text, symbols, diagrams, etc. Digits are an essential element of handwritten documents. Accurate recognition of handwritten digits is vital for effective communication and data analysis. Various researchers have attempted to address this issue with modern convolutional neural network (CNN) techniques. Even after training, CNN filter weights remain unchanged despite the high identification accuracy. As a result, the process cannot flexibly adapt to input changes. Hence computer vision researchers have recently become interested in Vision Transformers (ViTs) and Multilayer Perceptrons (MLPs). The shortcomings of CNNs gave rise to a hybrid model revolution that combines the best elements of the two fields. This paper analyzes how the hybrid convolutional ViT model affects the ability to recognize handwritten digits. Also, the real-time data contains noise, distortions, and varying writing styles. Hence, cleaned and uncleaned handwritten digit images are used for evaluation in this paper. The accuracy of the proposed method is compared with the state-of-the-art techniques, and the result shows that the proposed model achieves the highest recognition accuracy. Also, the probable solutions for recognizing other aspects of handwritten documents are discussed in this paper.• Analyzed the effect of convolutional vision transformer on cleaned and real-time handwritten digit images.• The model's performance improved with the implication of cross-validation and hyper-parameter tuning.• The results show that the proposed model is robust, feasible, and effective on cleaned and uncleaned handwritten digits.


Method details
The construction of the vision transformer is depicted in the Fig. 1 [ 1 ].The transformer receives embedded patches as input.The embeddings are constructed by dividing the image into patches of the same size.Padding accounts for the disparity between patch size and image size.Since the transformer requires a 1D sequence as input, the patches are projected onto a 1D vector.The position of patches is learned through positional embedding.Each layer is subjected to layer normalization [ 2 ].Eq. ( 1) provides the formulas for layer normalization for 1D sequence vector  . where, The average of the elements in v is μ.
The standard deviation of the components in v is σ.
The scaling parameter is γ, and A biased vector parameter is β.
Self-attention is the mechanism employed in multi-head attention.A query ( Q ), key ( K ), and value are provided as the input to self-attention ( V ).To the output, it maps Q and K -V pairs.The softmax function calculates the weight of the value vector.The definition of attention is given in Eq. ( 2) [ 3 ].
The MLP-head layer divides the data into smaller chunks and calculates the attention of each head independently and simultaneously.Gaussian error linear unit (GeLU) activation function is used in the two-layer feed-forward MLP layer as shown in Eq. ( 3) [ 4 ].
where the Gaussian cumulative distribution is denoted by  (  ) .The largest historical handwritten digit dataset (DIDA) [ 5 , 6 ] has 250,000 images, and the Extended Modified National Institute of Standards and Technology (EMNIST) digits dataset [ 7 ] comprises 280,000 images.This study employs the hybrid convolutional vision transformer (CViT) [ 8 ] to recognize handwritten numbers.
The images are first scaled to ensure that every image in the dataset is the same size before it is delivered to the model.The effectiveness of the model is improved with it.The photos are then standardized with the help of mean and standard deviation.Rescaling is another normalization name, making using the same procedure on all images easier.All photos are scaled, which results in a constant learning rate.
Wu et al. [ 8 ] proposed the CViT architecture.Three stages make up the architecture.We changed the transformer's parameters to accommodate handwritten digit datasets.To increase model performance and accuracy, the value of the depth parameter is modified in particular.The generation of embedding patches and feature extraction are the two primary uses of CNN.

Datasets
The EMNIST-digit and DIDA benchmark datasets are used for experimentation.There are 280,000 images in the EMNIST-digit dataset, of which 252,000 are used for training and 28,000 for testing.Images used for training and testing are altered after each fold as cross-validation is used.The data set is split into ten balanced classes, numbered 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Fig. 2 shows a sample image of this dataset.

Model parameters and performance metrics
Choosing the optimal hyper-parameter value for the framework is an especially crucial step.Optuna [ 9 ] is a hyper-parameter optimizing methodology for black-box optimization methods and machine learning.As a black-box optimizer, Optuna evaluates the behavior of the hyper-parameters.The optuna trial object specifies the type and scope of the hyper-parameters that must be tweaked.The [ "Adam, " "RMSprop, " or "SGD " optimizer is selected.The backpropagation learning values range from 0.00001 to 0.1 in a logarithmic fashion.With a scaling factor of 16, the batch size is between 16 and 256.
The optimizer, stochastic gradient descent (SGD), with a learning rate of 0.01 and batch size of 192, was suggested by the Optuna application on the EMNIST dataset.The ideal dropout value found was 0.25.The PyTorch library is used for experimentation and classification with the logarithmic softmax function.The negative log-likelihood (NLL) loss is used to calculate the loss.

Results and discussion
Due to the sparse data, the experiment uses K-fold cross-validation with ten folds to prevent overfitting.On the EMNIST-digit dataset, the overall 10-fold accuracy is 99.89 %.The accuracy, precision, recall, and f1-score are all 99.89 %.A maximum accuracy of 99.95 % was attained in the most recent fold.The model's ROC AUC score is 99.99 percent.Fig. 4 demonstrates the results for the EMNIST dataset, where a few examples with their actual, predicted, and image values are displayed.Fig. 6 shows the classwise precision, recall, and f-measure analysis.The proposed model can classify digits 0, 1, 2, 5, 6, and 8 more accurately than 3, 4, 7, and 9.
To determine the impact of the model on original photos, experiments are also conducted using the DIDA dataset.The average recall, accuracy, precision, and f1-score are 99.81 %.The confusion matrix for the DIDA dataset is depicted in Fig. 7 .
Fig. 8 shows the DIDA dataset's classwise precision, recall, and f-measure analysis.As the DIDA dataset is imbalanced, the difference in classwise classification is visible in Fig. 8 .The writing styles of 1 and 7 are somewhat similar; hence, they are difficult to classify.
The obtained roc-auc score is 99.99 percent.The outcome of the prediction on a few sample photos is displayed in Fig. 9 .As a result, the model can identify numbers in original images.
The fold-wise accuracy on both the datasets, i.e., EMNIST and DIDA, is shown in Fig. 10 .Though the DIDA datasets images are uncleaned, the proposed method achieved comparative accuracy like the EMNIST cleaned images dataset.Fig. 10 also depicts that the model is not overfitting.
The efficient model must be capable of differentiating among the digits classes.Hence, Fig. 11 compares the classwise f-measure between the EMNIST and DIDA datasets.The Fig. 11 clearly shows that the model is robust and can distinguish digits effectively.
The ability of the framework to appropriately measure the positive cases is termed sensitivity, and the power of the algorithm to recognize examples of the harmful category appropriately is measured by specificity.Both dataset's class-wise sensitivity and   Table 1 compares the performance of the suggested technique with different architectures on the EMNIST dataset.Comparing the proposed methodology to state-of-the-art methods, the accuracy was greater.Thus, jobs requiring handwriting recognition can be completed using vision transformers.Also, the model results on uncleaned images imply that the attention mechanism is robust to noise and effective in different writing styles.Hence, in any real-time scenario, it can recognize handwritten data such as pin code and date recognition from postal letters and historical documents, respectively.

Future research directions
The proposed model can be generalized for recognizing multilingual handwritten digits and symbols, as shown in Fig. 14

Table 1
Proposed framework accuracy compared to other stateof-the-art methods on EMNIST-digit dataset.
Integrating the proposed method with a graph attention network (GAN) will recognize the handwritten mathematical expressions.Also, the model can be further enhanced to recognize hand-drawn flowcharts, electrical circuit diagrams, molecule symbols, etc.This will be useful, especially in distance education, where students learn online with the help of recent technologies.

Fig. 2 .Fig. 3 .
Fig. 2. Digit samples from the EMNIST dataset.EMNIST-digit database images have been size-normalized, denoised, and cleaned.DIDA, a different benchmark dataset, is therefore employed to evaluate the suggested model.The images of DIDA are uncleaned.Two hundred fifty-two thousand eight hundred sixty photos cropped from historical documents make up the DIDA dataset.The writing styles, sizes, orientations, widths, and layouts of the digits in the DIDA dataset vary.There are ten classes in the DIDA dataset as well.Fig.3displays an example image of each category.

Fig. 4
Fig. 4 demonstrates how the framework can accurately predict the outcomes of rotated images.The final fold's confusion matrix testing results are depicted in Fig. 5 .The count of accurate predictions broken down by class is shown in the blue box.Fig.6shows the classwise precision, recall, and f-measure analysis.The proposed model can classify digits 0, 1, 2, 5, 6, and 8 more accurately than 3, 4, 7, and 9.To determine the impact of the model on original photos, experiments are also conducted using the DIDA dataset.The average recall, accuracy, precision, and f1-score are 99.81 %.The confusion matrix for the DIDA dataset is depicted in Fig.7.Fig.8shows the DIDA dataset's classwise precision, recall, and f-measure analysis.As the DIDA dataset is imbalanced, the difference in classwise classification is visible in Fig.8.The writing styles of 1 and 7 are somewhat similar; hence, they are difficult to classify.The obtained roc-auc score is 99.99 percent.The outcome of the prediction on a few sample photos is displayed in Fig.9.As a result, the model can identify numbers in original images.The fold-wise accuracy on both the datasets, i.e., EMNIST and DIDA, is shown in Fig.10.Though the DIDA datasets images are uncleaned, the proposed method achieved comparative accuracy like the EMNIST cleaned images dataset.Fig.10also depicts that the model is not overfitting.The efficient model must be capable of differentiating among the digits classes.Hence, Fig.11compares the classwise f-measure between the EMNIST and DIDA datasets.The Fig.11clearly shows that the model is robust and can distinguish digits effectively.The ability of the framework to appropriately measure the positive cases is termed sensitivity, and the power of the algorithm to recognize examples of the harmful category appropriately is measured by specificity.Both dataset's class-wise sensitivity and
. The solid line in Fig. 14 indicates the proposed work and the dashed line indicates future experimentation with the proposed model.Recognition can be done by training the proposed model on multilingual and symbol datasets.

Fig. 14 .
Fig. 14.Future research directions of the proposed model.